62 research outputs found

    Design and implementation of parallel video encoding strategies using divisible load analysis

    Get PDF
    The processing time needed for motion estimation usually accounts for a significant part of the overall processing time of the video encoder. To improve the video encoding speed, reducing the execution time for motion estimation process is essential. Parallel implementation of video encoding systems using either the software or the hardware approach has attracted much attention in the area of real time video coding. In this paper, we attempt to implement a video encoder on a bus network. Usually, for such a parallel system, the key concern is associated with partitioning and balancing of the computational load among the processors such that the overall processing time of the video encoder is minimized. With the use of the divisible load theory (DLT) paradigm, a strip-wise load partitioning/balancing scheme, a load distribution strategy, two implementation strategies are developed to exploit the data parallelism inherent in the video encoding process. The striking feature of our design is that,both the granularity of the load partitions and all the associated overheads caused during parallel video encoding process can be explicitly considered. This significantly contributes to the minimization of the overall processing time of the video encoder. Extensive experimental studies are carried out to test the effectiveness of the proposed strategies. The performance of the parallel video encoder is quantified using the metrics speedup and performance gain, respectively. The experimental results show that our strategies are effective for exploiting the available parallelism inherent in the video encoding process and provide a theoretical insight on how to analytically quantify and minimize the overall processing time of a parallel system. The proposed strategies can be easily extended and applied to improve other existing parallel systems

    Fus-MAE: A cross-attention-based data fusion approach for Masked Autoencoders in remote sensing

    Full text link
    Self-supervised frameworks for representation learning have recently stirred up interest among the remote sensing community, given their potential to mitigate the high labeling costs associated with curating large satellite image datasets. In the realm of multimodal data fusion, while the often used contrastive learning methods can help bridging the domain gap between different sensor types, they rely on data augmentations techniques that require expertise and careful design, especially for multispectral remote sensing data. A possible but rather scarcely studied way to circumvent these limitations is to use a masked image modelling based pretraining strategy. In this paper, we introduce Fus-MAE, a self-supervised learning framework based on masked autoencoders that uses cross-attention to perform early and feature-level data fusion between synthetic aperture radar and multispectral optical data - two modalities with a significant domain gap. Our empirical findings demonstrate that Fus-MAE can effectively compete with contrastive learning strategies tailored for SAR-optical data fusion and outperforms other masked-autoencoders frameworks trained on a larger corpus

    Towards high performance computing for molecular structure prediction using IBM Cell Broadband Engine - an implementation perspective

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>RNA structure prediction problem is a computationally complex task, especially with pseudo-knots. The problem is well-studied in existing literature and predominantly uses highly coupled Dynamic Programming (DP) solutions. The problem scale and complexity become embarrassingly humungous to handle as sequence size increases. This makes the case for parallelization. Parallelization can be achieved by way of networked platforms (clusters, grids, etc) as well as using modern day multi-core chips.</p> <p>Methods</p> <p>In this paper, we exploit the parallelism capabilities of the IBM Cell Broadband Engine to parallelize an existing Dynamic Programming (DP) algorithm for RNA secondary structure prediction. We design three different implementation strategies that exploit the inherent data, code and/or hybrid parallelism, referred to as C-Par, D-Par and H-Par, and analyze their performances. Our approach attempts to introduce parallelism in critical sections of the algorithm. We ran our experiments on SONY Play Station 3 (PS3), which is based on the IBM Cell chip.</p> <p>Results</p> <p>Our results suggest that introducing parallelism in DP algorithm allows it to easily handle longer sequences which otherwise would consume a large amount of time in single core computers. The results further demonstrate the speed-up gain achieved in exploiting the inherent parallelism in the problem and also elicits the advantages of using multi-core platforms towards designing more sophisticated methodologies for handling a fairly long sequence of RNA.</p> <p>Conclusion</p> <p>The speed-up performance reported here is promising, especially when sequence length is long. To the best of our literature survey, the work reported in this paper is probably the first-of-its-kind to utilize the IBM Cell Broadband Engine (a heterogeneous multi-core chip) to implement a DP. The results also encourage using multi-core platforms towards designing more sophisticated methodologies for handling a fairly long sequence of RNA to predict its secondary structure.</p

    On the Design of Mutually Aware Optimal Pricing and Load Balancing Strategies for Grid Computing Systems

    Get PDF
    Abstract-Managing resources and cleverly pricing them on computing systems is a challenging task. Resource sharing demands careful load balancing and often strives to achieve a win-win situation between resource providers and users. Toward this goal, we consider a joint treatment of load balancing and pricing. We do not assume static pricing to determine load balancing, or vice versa. Instead, we study the relationship between the price that a computing node is charged and the load and revenue that it receives. We find that there exists an optimal price which maximizes the revenue. We then consider a multiuser environment and explore how the load from a user can be balanced on processors with existing loads. Finally, we derive an optimal price that maximizes the revenue in the multi-user environment. We evaluate the performance of the proposed algorithms through simulations

    Multi-GPU design and performance evaluation of homomorphic encryption on GPU clusters

    Get PDF
    We present a multi-GPU design, implementation and performance evaluation of the Halevi-Polyakov-Shoup (HPS) variant of the Fan-Vercauteren (FV) levelled Fully Homomorphic Encryption (FHE) scheme. Our design follows a data parallelism approach and uses partitioning methods to distribute the workload in FV primitives evenly across available GPUs. The design is put to address space and runtime requirements of FHE computations. It is also suitable for distributed-memory architectures, and includes efficient GPU-to-GPU data exchange protocols. Moreover, it is user-friendly as user intervention is not required for task decomposition, scheduling or load balancing. We implement and evaluate the performance of our design on two homogeneous and heterogeneous NVIDIA GPU clusters: K80, and a customized P100. We also provide a comparison with a recent shared-memory-based multi-core CPU implementation using two homomorphic circuits as workloads: vector addition and multiplication. Moreover, we use our multi-GPU Levelled-FHE to implement the inference circuit of two Convolutional Neural Networks (CNNs) to perform homomorphically image classification on encrypted images from the MNIST and CIFAR - 10 datasets. Our implementation provides 1 to 3 orders of magnitude speedup compared with the CPU implementation on vector operations. In terms of scalability, our design shows reasonable scalability curves when the GPUs are fully connected.This work is supported by A*STAR under its RIE2020 Advanced Manufacturing and Engineering (AME) Programmtic Programme (Award A19E3b0099).Peer ReviewedPostprint (author's final draft

    Implementation and Performance Evaluation of RNS Variants of the BFV Homomorphic Encryption Scheme

    Get PDF
    Homomorphic encryption is an emerging form of encryption that provides the ability to compute on encrypted data without ever decrypting them. Potential applications include aggregating sensitive encrypted data on a cloud environment and computing on the data in the cloud without compromising data privacy. There have been several recent advances resulting in new homomorphic encryption schemes and optimized variants. We implement and evaluate the performance of two optimized variants, namely Bajard-Eynard-Hasan-Zucca (BEHZ) and Halevi-Polyakov-Shoup (HPS), of the most promising homomorphic encryption scheme in CPU and GPU. The most interesting (and also unexpected) result of our performance evaluation is that the HPS variant in practice scales significantly better (typically by 15%-30%) with increase in multiplicative depth of the computation circuit than BEHZ, implying that the HPS variant will always outperform BEHZ for most practical applications. For the multiplicative depth of 98, our fastest GPU implementation performs homomorphic multiplication in 51 ms for 128-bit security settings, which is faster by two orders of magnitude than prior results and already practical for cloud environments supporting GPU computations. Large multiplicative depths supported by our implementations are required for applications involving deep neural networks, logistic regression learning, and other important machine learning problems

    Multi-Phase Cross-modal Learning for Noninvasive Gene Mutation Prediction in Hepatocellular Carcinoma

    Full text link
    Hepatocellular carcinoma (HCC) is the most common type of primary liver cancer and the fourth most common cause of cancer-related death worldwide. Understanding the underlying gene mutations in HCC provides great prognostic value for treatment planning and targeted therapy. Radiogenomics has revealed an association between non-invasive imaging features and molecular genomics. However, imaging feature identification is laborious and error-prone. In this paper, we propose an end-to-end deep learning framework for mutation prediction in APOB, COL11A1 and ATRX genes using multiphasic CT scans. Considering intra-tumour heterogeneity (ITH) in HCC, multi-region sampling technology is implemented to generate the dataset for experiments. Experimental results demonstrate the effectiveness of the proposed model.Comment: Accepted version to be published in the 42nd IEEE Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBC 2020, Montreal, Canad

    Reliability and energy-aware mapping and scheduling of multimedia applications on multiprocessor systems

    No full text
    Lifetime reliability is an emerging concern in multiprocessor systems as escalating power density and hence temperature variation continues to accelerate wear-out leading to a growing prominence of device defects. In this paper, we propose a system-level approach that involves performance-aware mapping of multimedia applications on a multiprocessor system to jointly minimize energy consumption and temperature related wear-out. Fundamental to this approach is a simplified temperature model that incorporates not only the transient and the steady-state behavior (temporal effect), but also the temperature dependency on the surrounding cores (spatial effect). This model is validated against the temperature obtained using the HotSpot tool with transient and steady-state simulations, and is shown to be accurate within 5.5 celsius, leading to an MTTF estimation accuracy of an average 21% with respect to the state-of-the-art approaches. The proposed temperature model is integrated in a gradient-based fast heuristic that controls the voltage and frequency of the cores to limit the average and peak temperature leading to a longer lifetime, simultaneously minimizing the energy consumption. Lifetime computation considers task remapping, which is a common feature available in modern multiprocessor systems. A linear programming approach is then proposed to distribute the cores of a multiprocessor system among concurrent applications to maximize the lifetime. Experiments conducted with a set of synthetic and real-life applications represented as synchronous data flow graphs demonstrate that the proposed approach minimizes energy consumption by an average 24% with 47% increase in lifetime. For concurrent applications, the proposed lifetime-aware core distribution results in an average 10\% improvement in lifetime as compared to performance-based core distribution

    Scheduling Strategies of Divisible Loads in DIN Networks

    No full text
    corecore